Max-Sum Diversification, Monotone Submodular Functions and Semi-metric Spaces
نویسندگان
چکیده
In many applications such as web-based search, document summarization, facility location and other applications, the results are preferable to be both representative and diversified subsets of documents. The goal of this study is to select a good “quality”, bounded-size subset of a given set of items, while maintaining their diversity relative to a semimetric distance function. This problem was first studied by Borodin et al [1], but a crucial property used throughout their proof is the triangle inequality. In this modified proof we want to relax the triangle inequality and relate the approximation ratio of max-sum diversification problem to the parameter of the relaxed triangle inequality in the normal form of the problem (i.e., a uniform matroid) and also in an arbitrary matroid. Introduction In many search applications, the search engine should guess the correct results from a given query; therefore, it is important to deliver a diversified and representative set of documents to a user. Diversification can be viewed as a trade-off between having more relevant results and having more diverse results among the top results for a given query [3]. “Jaguar” is a cliche example in the diversification literature [2, 4, 9], but it illustrates the point perfectly as it has different meanings including car, animal, and a football team. A set of good “quality” result should cover all these diversified items. The paper by Borodin et al [1] determines the good quality results with a monotone submodular function and defines diversity as the sum of distances between selected objects. Since they consider the distances to be metric, they ask in the conclusion section: For a relaxed version of the triangle inequality can we relate the approximation ratio to the parameter of a relaxed triangle inequality? In this study we answer to this question. We call this relaxed triangle inequality distance as semi-metric. A semi-metric distance on a set of items is just like a metric distance, but the triangle inequality is relaxed with a parameter α ≥ 1 (i.e., d(u, v) ≤ α(d(v, w) +d(w, u))). Answering to this question will make this method applicable to algorithms that are defined on semi-metric spaces, e.g., [5, 7, 8]. The IBM’s Query by Image Content system is one of the other best-known examples of the semi-metric usage in practice; although, it does not 2 S. Abbasi Zadeh and M. Ghadiri satisfy the triangle inequality [6]. By modifying the analysis of the previous proposed algorithms in [1], we will show that these algorithms can still achieve a 2α-approximation for this question in the case that there is not any matroid constraint and a 2α-approximation for an arbitrary matroid constraint. In other words, these new modified analysis are a generalization of the previous analysis as they are consistent with the previous approximation ratios for α = 1 (i.e., the metric distance). Problem 1. Max-Sum Diversification Let U be the underlying ground set, and let d(., .) be a semi-metric distance function on U . The goal of the problem is to find a subset S ⊆ U that: maximizes f(S) + λ ∑ {u,v}:u,v∈S d(u, v) subject to |S| = p, where p is a given constant number and λ is a parameter specifying a trade-off between the distance and submodular function. We give a 2α−approximation for this problem. Firstly we introduce our notations following [1]. For any S ⊆ U , we let d(S) = ∑ {u,v}:u,v∈S d(u, v). We can also define d(S, T ), for any two disjoint sets S and T as: d(S ∪ T )− d(S)− d(T ). Let φ(S) and u be the value of the objective function and an element in U − S respectively. We can define the marginal gain of the distance function as du(S) = ∑ v∈S d(u, v) and similarly marginal gain of the wight function as: fu(S) = f(S + u)− f(S). The total marginal gain can also be defined using du(S) and fu(S) as φu(S) = fu(S) + λdu(S). Let f ′ u(S) = 1 2 fu(S), φu(S) = f ′ u(S) + λdu(S). Starting with an empty set S, the greedy algorithm (Algorithm 1) adds an element u from U − S in each iteration, in such a way that maximize φu(S). Lemma 1. Given an α-relaxed triangle inequality semi-metric distance function d(., .), and two disjoint sets X and Y , we have the following inequality: α(|X| − 1)d(X,Y ) ≥ |Y |d(X) Max-Sum Diversification and Semi-metric Spaces 3 Algorithm 1 Greedy algorithm 1: Input 2: U : set of ground elements 3: p: size of final set 4: Output 5: S: set of selected elements with size p 6: S = ∅ 7: while |S| < p do 8: find u ∈ U \ S maximizing φu(S) 9: S = S ∪ {u} 10: end while 11: return S Proof. Consider u, v ∈ X and an arbitrary w ∈ Y . We know that: α(d(v, w) + d(w, u)) ≥ d(u, v) By changing w we get: α(d({v}, Y ) + d({u}, Y )) ≥ |Y |d(u, v) and then all combinations of u and v: α(|X| − 1)d(X,Y ) ≥ |Y |d(X) Theorem 1. Algorithm 1 achieves a 2α-approximation for solving Problem 1 with α-relaxed distance d(., .) and monotone submodular function f . Proof. Let Gi be the greedy solution at the end of step i, i < p and G be the greedy solution at the end of the algorithm. Suppose that O is the optimal solution and let A = O ∩ Gi, B = Gi \ A and C = O \ A. Obviously the algorithm achieves the optimal solution when p = 1; thus we assume p > 1. Now we consider two different cases: |C| = 1 and |C| > 1. If |C| = 1 then i = p − 1. Let C = {v} and u be the element that algorithm will take for the next (last) step. Then for all v ∈ U \ S we have: φu(Gi) ≥ φv(Gi) f ′ u(Gi) + λdu(Gi) ≥ f ′ v(Gi) + λdv(Gi) thus: φu(Gi) = fu(Gi) + λdu(Gi) ≥ f ′ u(Gi) + λdu(Gi) ≥ f ′ v(Gi) + λdv(Gi) ≥ 1 2 φv(Gi) 4 S. Abbasi Zadeh and M. Ghadiri as a result φ(G) ≥ 1 2φ(O) ≥ 1 2αφ(O). Now consider |C| > 1. By using Lemma 1 we have the following inequalities: α(|C| − 1)d(B,C) ≥ |B|d(C) (1) α(|C| − 1)d(A,C) ≥ |A|d(C) (2) α(|A| − 1)d(A,C) ≥ |C|d(A) (3) A and C are two disjoint sets and we know that A ∪ C = O; thus: d(A,C) + d(A) + d(C) = d(O) (4) We can assume that p > 1 and |C| > 1 (The greedy algorithm obviously finds the optimal solution when p = 1). Then following multipliers are applied to equations 1, 2, 3, 4 respectively: 1 (|C|−1) , |C|−|B| p(|C|−1) , i p(p−1) , i|C| αp(p−1) . If we add them, we have: d(B,C) + d(A,C)− d(A,C) i|C|(1− 1 α ) p(p− 1) − d(C) i|C|(p− |C|) αp(p− 1)(|C| − 1) ≥ d(O) i|C| αp(p− 1) Since p > |C| and α ≥ 1, d(A,C) + d(B,C) ≥ d(O) i|C| αp(p− 1) . thus (we substituted 1 α with x, thus 0 < x ≤ 1), d(C,Gi) ≥ d(O) xi|C| p(p− 1) From the submodularity of f ′(.) we can get ∑ v∈C f ′ v(Gi) ≥ f ′(C ∪Gi)− f (Gi) also the monotonity of f ′(.) suggests that f ′(C ∪Gi)− f (Gi) ≥ f ′(O)− f ′(G). Subsequently we have: ∑ v∈C f ′ v(Gi) ≥ f ′(O)− f ′(G). Max-Sum Diversification and Semi-metric Spaces 5 Therefore ∑ v∈C φv(Gi) = ∑ v∈C [f ′ v(Gi) + λd({v}, Gi)] = ∑ v∈C f ′ v(Gi) + λd(C,Gi) ≥ [f ′(O)− f ′(G)] + d(O) λxi|C| p(p− 1) . Let ui+1 be the element taken at step (i+ 1), then we have φui+1(Gi) ≥ 1 p [f ′(O)− f ′(G)] + d(O) λxi p(p− 1) . If we sum over all i from 0 to p− 1, we have φ′(G) = p−1 ∑ i=0 φui+1(Gi) ≥ [f ′(O)− f ′(G)] + d(O) 2 Hence, f ′(G) + λd(G) ≥ f ′(O)− f ′(G) + d(O) 2 and φ(G) = f(G) + λd(G) ≥ 1 2 [f(O) + xλd(O)] ≥ x 2 [f(O) + λd(O)] = 1 2α φ(O). u t Problem 2. Max-Sum Diversification for Matroids Let U be the underlying ground set, and F be the set of independent subsets of U such that M =< U,F > is a matroid. Let d(., .) be a semi-metric distance function on U and f(.) be a non-negative monotone submodular set function measuring the weight of the subsets of U . This problem aims to find a subset S ⊆ F that: maximizes f(S) + λ ∑ {u,v}:u,v∈S d(u, v) where λ is a parameter specifying a trade-off between the two objectives. Again, φ(S) is the value of the objective function. Because of the monotonicity of the φ(.), S should be a basis of the matroid M. We give a 2α−approximation for this problem. 6 S. Abbasi Zadeh and M. Ghadiri Without loss of generality, we assume that the rank of the matroid is greater than one. Let {x, y} = argmax x,y∈F [f({x, y}) + λd(x, y)]. We now consider the following local search algorithm: Algorithm 2 Local Search algorithm 1: Input 2: U : set of ground elements 3: M =< U ,F >: a matroid on U 4: S: a basis of M containing both x and y 5: Output 6: S 7: while ∃{u ∈ (U − S)∧ v ∈ S} such that S + u− v ∈ F ∧ φ(S + u− v) > φ(S) do 8: S = S + u− v 9: end while 10: return S Theorem 2. Algorithm 2 achieves an approximation ratio of 2α for max-sum diversification with a matroid constraint. As the algorithm is optimal for the case that the rank of the matroid is two, we assume that the rank of the matroid is greater than two. The notation is like before and O and S are the optimal solution and the solution at the end of the local search algorithm, respectively. Let A = O ∩S, B = S −A and C = O−A. We utilize the following two lemmas from the [1]. Lemma 2. For any two sets X,Y ∈ F with |X| = |Y |, there is a bijective mapping g : X → Y such that X − x+ g(x) ∈ F for any x ∈ X. Since both S and O are bases of the matroid, they have the same cardinality; subsequently, B and C have the same cardinality, too. Let g : B → C be the bijective mapping results from Lemma 2 such that S − b + g(b) ∈ F for any b ∈ B. Let B = {b1, b2, ..., bt}, and let ci = g(bi) for all i. As claimed before, since the algorithm is optimal for t = 1, we assume t ≥ 2. Lemma 3. ∑t i=1 f(S − bi + ci) ≥ (t− 2)f(S) + f(O). Now we are going to prove two lemmas regarding to our semi-metric distance function. Lemma 4. If t > 2, α(d(B,C)− ∑t i=1 d(bi, ci)) ≥ d(C). Proof. For any bi, cj , ck, we have α(d(bi, cj) + d(bi, ck)) ≥ d(cj , ck). Max-Sum Diversification and Semi-metric Spaces 7 Summing up these inequalities over all i, j, k with i 6= j, i 6= k, j 6= k, we have each d(bi, cj) with i 6= j is counted (t− 2) times; and each d(ci, cj) with i 6= j is counted (t− 2) times. Therefore α(t− 2)[d(B,C)− t ∑ i=1 d(bi, ci)] ≥ (t− 2)d(C), and the lemma follows. Lemma 5. ∑t i=1 d(S − bi + ci) ≥ (t− 2)d(S) + 1 αd(O). Proof.
منابع مشابه
A Max-Sum Diversification, Monotone Submodular Functions and Dynamic Updates
Result diversification is an important aspect in web-based search, document summarization, facility location, portfolio management and other applications. Given a set of ranked results for a set of objects (e.g. web documents, facilities, etc.) with a distance between any pair, the goal is to select a subset S satisfying the following three criteria: (a) the subset S satisfies some constraint (...
متن کاملcs . D S ] 2 8 M ar 2 01 2 Max - Sum Diversification , Monotone Submodular Functions and Dynamic Updates
Result diversification has many important applications in databases, operations research, information retrieval, and finance. In this paper, we study and extend a particular version of result diversification, known as max-sum diversification. More specifically, we consider the setting where we are given a set of elements in a metric space and a set valuation function f defined on every subset. ...
متن کاملLocal Search for Max-Sum Diversification
We provide simple and fast polynomial time approximation schemes (PTASs) for several variants of the max-sum diversification problem which, in its most basic form, is as follows: Given n points p1, . . . , pn ∈ R and an integer k, select k points such that the average Euclidean distance between these points is maximized. This problem commonly appears in information retrieval and web-search in o...
متن کاملMaximization of Non-Monotone Submodular Functions
A litany of questions from a wide variety of scientific disciplines can be cast as non-monotone submodular maximization problems. Since this class of problems includes max-cut, it is NP-hard. Thus, general purpose algorithms for the class tend to be approximation algorithms. For unconstrained problem instances, one recent innovation in this vein includes an algorithm of Buchbinder et al. (2012)...
متن کاملMaximizing Non-monotone Submodular Functions under Matroid and Knapsack Constraints
Submodular function maximization is a central problem in combinatorial optimization, generalizing many important problems including Max Cut in directed/undirected graphs and in hypergraphs, certain constraint satisfaction problems, maximum entropy sampling, and maximum facility location problems. Unlike submodular minimization, submodular maximization is NP-hard. In this paper, we give the firs...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1511.02402 شماره
صفحات -
تاریخ انتشار 2015